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Abstract 


Known  polylog  parallel  algorithms  for  the  solution  of  linear  systems  and 
related  problems  require  computation  of  the  characteristic  polynomial  or 
related  forms,  which  are  known  to  be  highly  unstable  in  practice.  How¬ 
ever,  matrix  factorisations  of  various  types,  bypassing  computation  of  the 
characteristic  polynomial,  are  used  extensively  in  sequentisJ  numerical  com¬ 
putations  and  are  essential  in  many  applications. 

This  paper  gives  new  parallel  methods  for  various  exact  factorisations 
of  several  classes  of  matrices.  We  assume  the  input  matrices  are  n  x  n  with 
either  integer  entries  of  sise  <  2"°'*'  or  rational  entries  expressible  as  a 
ratio  of  integers  of  sise  <  2"°*'*.  We  make  no  other  assumption  on  the 
input.  We  assume  the  arithmetic  PRAM  model  of  parallel  computation. 
Our  main  result  is  a  reduction  of  the  known  parallel  time  bounds  for  these 
factorisations  from  O(log^n)  to  O(log^n).  Our  results  are  work  efficient; 
we  match  the  best  known  work  bounds  of  parallel  algorithms  with  polylog 
time  bounds,  and  are  within  a  log  n  factor  of  the  work  bounds  for  the  best 
known  sequential  algorithms  for  the  same  problems. 

The  exact  factorisations  we  compute  for  symmetric  positive  definite  ma¬ 
trices  include:  recursive  factorisation  sequences  and  trees.  LU  factorisa¬ 
tions.  QR  factorisations,  and  reduction  into  upper  Hessenberg  form. 

The  classes  of  matrices  for  which  we  can  efficiently  compute  these  fac¬ 
torisations  include: 

1.  dense  matrices,  in  time  O(log’n)  with  processor  bound  P(n)  (the 
number  of  processors  needed  to  multiply  two  n  x  n  matrices  in  0{  log  n ) 
time), 

2.  block  diagonal  matrices,  in  time  0(log'6)  with  P{b)n/b  processors. 

3.  sparse  matrices  which  are  s(n)-separable  (recursive  factorisations  only ). 
in  time  0(log'  n)  with  P{s(n))  processors  where  s(n)  is  of  the  form 
n"'  for  0  <  7  <  1.  and 

4.  banded  matrices,  in  parallel  time  O((logn)  log  6)  with  P(b)n/b  proces¬ 
sors. 


Our  factorizations  also  provide  us  similarly  efficient  algorithms  for  e.xact 
computation  (given  arbitrary  rational  matrices  that  need  not  be  symmet¬ 
ric  positive  definite)  of  the  following;  solution  of  the  corresponding  linear 
systems,  the  determinant,  the  inverse. 

Thus  our  results  provide  the  first  known  efficient  parallel  algorithms  for 
exact  solution  of  these  matrix  problems,  that  avoids  computation  of  the 
charstcteristic  polynomisd  or  related  forms.  Instead  we  use  a  construction 
which  modifies  the  input  matrix,  which  may  initially  have  arbitrary  condi¬ 
tion,  so  as  to  have  condition  nearly  1.  and  then  applies  a  multilevel,  pipelined 
Newton  iteration,  followed  by  a  similar  multilevel,  pipelined  Hensel  Lifting. 
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1  Introduction 

1.1  Assumptions  and  Machine  Model 

For  our  model  of  computation,  we  assume  the  algebraic  Parallel  Random  Access 
Machine  (PRAM)  where  each  arithmetic  or  logical  operation  such  as  addition, 
subtraction,  multiplication,  division,  and  comparison  can  be  done  in  one  step  by 
a  given  processor.  Processors  can  execute  such  operations  in  parallel.  Our  time 
complexity  bounds  are  based  on  arithmetic  complexity,  that  is  the  number  of 
these  parallel  steps.  We  assume  the  n  x  n  matrices  input  to  our  algorithms  have 
integer  entries  of  size  <  or  rational  entries  expressible  as  ratio  of  integers 

of  size  <  The  matrices  may  be  dense,  sparse  separable,  or  banded. 

Let  P(n)  =  n***  be  the  minimum  number  of  PRAM  processors  necessary  to 
multiply  two  n  x  n  matrices  in  O(logn)  parallel  steps.  (Currently,  w  =  2.376 
and  we  assume  w  >  2.) 

1.2  Motivation 

Many  problems  in  engineering  and  science  rely  on  the  solution  of  linear  systems. 
As  the  problem  size  grows,  the  resulting  linear  systems  can  grow  to  enormous 
size,  and  can,  in  turn,  require  very  large  computational  effort  to  solve.  This 
motivates  both  the  search  for  work  efficient  and  simultaneous  parallel  time  fast 
algorithms. 

One  of  the  success  stories  in  the  field  of  computer  science  algorithms  and 
numerical  analysis  is  the  development  of  efficient  sequential  algorithms  for  rapid 
solution  of  linear  systems.  Previous  researchers  have  exploited  sparsity  and/or 
the  structure  of  the  linear  system  to  improve  these  computations.  In  practice, 
the  use  of  parallel  processing  can  give  a  further  increase  in  speed.  However,  there 
remains  a  considerable  discrepancy  between  the  theoretical  methods  proposed 
for  parallel  solution  of  linear  systems  and  the  methods  that  are  actually  used. 
This  is  the  main  motivation  of  our  paper. 

Pan  and  Reif  [PR  85. PR  89j  show  that  the  inverse  of  a  well  conditioned 
nonsingular  n  x  n  dense  integer  or  rational  matrix,  and  the  solution  of  the 
corresponding  linear  system,  can  be  approximately  computed  with  high  ac¬ 
curacy  by  Newton  iterations  in  parallel  time  O(log'n)  with  P(n)  processors, 
without  construction  of  the  characteristic  polynomial.  Subsequently  Pan  and 
others  [Pa  85, Pan  87]  showed  that  the  inverse  of  an  arbitrary  (not  necessarily 
well  conditioned)  nonsingular  n  x  n  dense  integer  or  rational  matrix  A  run  be 
exactly  computed  in  parallel  time  O(log^n)  with  P(n)  processors,  using  a  re¬ 
duction  to  the  computation  of  the  characteristic  polynomial  or  a  related  form. 
(Similar  reductions  to  can  also  be  made  for  matrices  over  an  arbitrary  fields 
[KP  91, KP  92,J  92].)  Thus  the  linear  system  Ax  =  6  can  theoretically  be  ex¬ 
actly  solved  within  this  complexity  by  the  computation  x  =  where  A“* 

is  the  matrix  inverse. 
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However,  numerical  analysis  practitioners  certainly  would  gener2Jly  not  solve 
linear  systems  by  construction  of  the  characteristic  polynomial  or  a  related  form, 
which  is  known  to  be  numerically  unstable.  Many  practicing  numerical  analysts 
would  object  to  this  approach,  for  well  known  stability  considerations. 

Instead,  they  generally  much  prefer  to  factorize  the  matrices,  and  solve  tri¬ 
angular  linear  systems.  For  example,  suppose  we  are  given  a  nonsingular  n  x  n 
matrix  .A.  If  i4  is  symmetric  positive  definite,  then  A  will  be  factored  .4  =  LU 
where  U  =  Z>^(also  known  in  this  case  as  a  Cholesky  factorization),  where  L  is 
nonsingular  lower  triangular  and  U  is  nonsingular  upper  triangular. 

The  requirement  that  the  matrix  be  symmetric  positive  definite  presents  no 
difficulties,  as  we  now  show.  Even  if  A  is  not  symmetric  positive  definite,  then 
A^A  is,  so  A^i4  can  be  factored  as  =  LU  (this  squares  the  condition 
number,  which  may  be  problem  if  A  is  badly  conditioned).  With  this  prepro¬ 
cessing,  and  given  an  n-vector  6,  the  linear  system  Ax  =  b  can  be  solved  in  two 
back-solving  stages  by  first  solving  for  n-vector  y  and  then  for  n-vector  r  in 
the  triangular  linear  systems  Ly  =  bMz  —  y  (For  which  there  are  well  known 
highly  efficient  parallel  algorithms;  see  JaJa  [J  92]).  Then  if  A  =  LU  then  we 
can  let  i  =  ;,  and  otherwise  if  A^ A  =  LU  then  we  can  let  r  =  Az. 

Known  efficient  parallel  algorithms  for  exact  computation  of  the  determinant 
of  ,4  (see  (Pa  85, Pan  87,KP  91, KP  92,J  92])  would  also  require  the  computation 
of  the  characteristic  polynomial  of  .4.  In  contrast,  given  the  LU  decomposition 
as  above,  if  /I  =s  LU,  then  det{A)  —  det{L)det{U),  and  otherwise  if  ,4^.4  =  LU , 
then  det(A)^  =  det(A^A)  =  det{L)det(U).  In  either  case,  the  determinants  of 
these  triangular  matrices  are  obtained  by  multiplying  all  the  elements  of  their 
principal  diagonals. 

Furthermore,  singular  value  decompositions  and  eigenvalue  computations 
generally  begin  with  QR  factorization  which  is  computable  from  the  LU  fac¬ 
torization.  Eigenvalue  computations  generally  use  a  further  reduction  to  upper 
Hessenberg  form  computed  from  the  QR  factorization  (see  Golub  and  van  Loan 
[GL  83]).  Thus  matrix  factorizations  of  various  types  are  used  extensively  in 
numerical  computations  and  are  essential  in  many  applications. 

For  dense  matrices,  known  efficient  parallel  algorithms  for  exact  LU  and  QR 
factorization  and  reduction  to  upper  Hes.senberg  form  ([Pan  87])  cost  O(log^n) 
time  and  also  require  the  computation  of  the  characteristic  polynomial  of  .4. 
The  objective  of  this  paper  is  to  provide  efficient  parallel  algorithms  to  factor 
matrices  in  time  O(log'n)  with  P(n)  processors,  without  computation  of  the 
characteristic  polynomial. 

1.3  Our  Parallel  Algorithms  for  Dense,  Sparse  and  Banded 
Matrices 

The  general  technique  of  approximate  solution,  followed  by  Hensel  Lifting,  has 
a  long  history  in  numerical  and  algebraic  computation.  For  a  recent  exampl>. 
see  Pan  [Pa  85, Pan  87],  We  use  such  a  construction  which  modifies  the  input 
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matrix  (which  initially  may  have  arbitrary  condition,  so  could  be  very  badly 
conditioned),  so  that  the  resulting  matrix  is  extremely  diagonally  dominant 
and  has  condition  nearly  1.  Explicit  computation  of  matrix  inverse  of  badly 
conditioned  matrices  is  thus  avoided,  and  instead  we  approximate  the  inverse 
of  diagonally  dominant  matrices.  To  compute  the  recursive  factorization,  we 
apply  a  multilevel,  pipelined  Newton  iteration,  followed  by  Hensel  Lifting.  Our 
algorithms  ^ve  an  exact  factorization,  but  nevetheless  avoid  computation  of  the 
characteristic  polynomial  or  related  forms. 

Using  reductions  to  the  recursively  factorization  algorithm,  we  exactly  com¬ 
pute,  for  symmetric  positive  definite  matrices,  LU  and  QR  factorizations,  and 
also  compute  their  reduction  into  upper  Hessenberg  form.  Using  further  re¬ 
ductions  to  the  LU  factorization,  we  exactly  compute,  for  arbitrary  integer  or 
rational  matrices  (which  need  not  be  symmetric  positive  definite)  solution  of 
the  corresponding  linear  systems,  the  determinant,  the  inverse. 

1.  For  dense  matrices,  we  show  that  ail  of  these  factorizations  and  computa¬ 
tions  can  be  done  in  parallel  time  0(log'  n)  with  P(n)  processors. 

2.  For  sparse  matrices,  (with  0(n)  non-zero  entries)  which  are  8(n)-separable 
(i.e.  the  separator  size  of  the  sparsity  graph  of  the  matrix  is  of  size  s(n)), 
we  show  LU  and  QR  factorizations  can  be  done  in  parallel  time  0(log'  n 
with  P(s(n))  processors  where  s(n)  is  of  the  form  n"’'  for  0  <  7  <  1. 

3.  For  fr-block  diagonal  matrices  (consisting  of  a  sequence  of  6  x  6  blocks  on 
the  diagonal)  we  show  LU  and  QR  factorizations  can  be  done  in  parallel 
time  O(log^6)  with  P{b)n/b  processors. 

4.  For  6-banded  matrices,  (where  the  non-zeros  occur  within  only  a  band  of 
width  6  around  the  diagonal)  we  show  LU  and  QR  factorizations  can  be 
done  in  parallel  time  0(lognlog6)  with  P(b)n/b  processors. 

Previous  work  of  Pan  [Pan  87]  gave  parallel  time  Oflog^n)  with  P(»)  pro¬ 
cessors  for  factoring  dense  symmetric  positive  definite  matrices,  including  LU 
and  QR  factorizations,  and  also  computing  their  reduction  into  upper  Hessen¬ 
berg  form.  Pan  shows  the  solution  and  determinant  of  6-banded  matrices  can 
be  computed  in  parallel  time  O(lognlog6)  with  P(b)n/b  processors,  but  his  re¬ 
sults  do  not  extend  to  LU  or  QR  factorizations.  Armon  and  Reif  [AR  92]  gave 
parallel  time  O(log^n)  with  P(s(n))‘'*'‘  processors,  for  f  >  0,  for  recursive  fac¬ 
torization  (but  not  LU  or  QR  factorizations)  of  nonsingular  symmetric  positive 
definite  matrices. 


1.4  Organization  of  the  Paper 

In  Section  1,  we  have  motivated  the  problems  we  solve  and  stated  our  results 
and  previous  results.  Section  2  is  a  preliminary.  In  this  section,  we  define 
matrix  notations,  and  problems,  as  well  as  introduce  the  Recursive  Factorization 
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(RF)  Sequence  and  Tree  of  matrices,  and  show  that  matrix  problems  can  be 
efficiently  reduced  to  RF  computation  in  0{log^n)  parallel  time  using  P(n) 
processors.  Section  3  gives  the  parallel  algorithm  to  compute  the  RF  IVee  of 
a  matrix  with  parallel  time  O(log^n)  using  P(n)  processors.  Details  of  the 
two  components  of  the  algorithm,  namely  Newton-Hensel  Lifting  and  Newton 
Iteration  are  left  out  of  Section  3  and  are  dealt  with  in  detail  in  Section  4  and 
Section  5  separately.  Section  6  extends  these  results  to  parallel  nested  dissection, 
giving  efficient  parallel  factorizations  of  symmetric  matrices  with  s(n)-separable 
graphs  in  O(log^n)  time  and  P(s(n))  processors.  Section  7  discusses  banded 
matrices. 

2  Preliminaries 

This  section  contains  some  definitions,  notations  and  previously  known  results 
that  will  be  of  use  later  in  this  paper. 

2.1  Matrix  Definitions  and  Notation 

2.1.1  Matrix  Assumptions 

All  matrix  products  are  inner  products.  Vectors  and  matrices  will  be  denoted 
by  lower  and  upper  case  characters,  respectively. 

Generally,  we  assume  input  A  is  a  square  matrix  of  size  nxn,  except  where  we 
compute  QR-factors,  where  the  input  A  can  be  a  rectangular  matrix  of  size  m  x  n 
,  where  m  >  n.  We  assume  throughout  this  paper,  without  loss  of  generality, 
that  n  is  a  power  of  2.  Although  all  our  results  apply  to  rational  matrices, 
note  that  we  can  always  multiply  a  rational  matrix  by  an  appropriate  integer 
to  form  an  integer  matrix.  For  simplicity,  we  assume  without  loss  of  generality 
throughout  this  paper  the  matrices  input  to  our  factorization  algorithms  have 
integer  entries  of  size  <  2"°*'* .  However,  our  algorithms  will  in  general  generate 
and  output  rational  matrices. 

Throughout  this  paper  all  logarithms  are  base  2. 

2.1.2  Matrix  Definitions  and  Specialized  Matrices 

I  and  O  will  denote  identity  and  null  matrices  of  the  appropriate  sizes.  .4^  will 
denote  the  transpose  of  a  matrix  A.  A  is  define  to  be  symmetric  if  A^  =  A. 
Let  det(A)  denote  the  determinant  of  A.  A  is  singular  if  det(A)  =  0,  else  it  is 
nonsingular.  If  A  is  nonsingular,  then  the  inverse  A~‘  is  defined  and  the  adjotnl 
is  the  matrix  adj(A)  =det(A)A~‘  (the  adjoint  is  an  integer  matrix  if  A  is). 

Let  A  =  [oy]  denote  the  elements  of  A.  The  principal  diagonal  of  A  are 
the  elements  an.uzz,  ■ .  -  ><>nn-  A  miner  of  A  is  a  sub-matrix  induced  by  a 
sequence  of  rows,  say  ii ,  ij , . . . ,  i„'  and  columns,  say  ii ,  j2 , .  ■ . ,  jm'  ■  A  principal 
minor  of  A  is  a  square  minor  induced  by  selecting  the  same  sequence  of  row 
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indices,  say  t'l  =  ji,  t's  =  is,  ■  ■  ■  ,tn'  =  in'  as  column  indices,  and  it  is  a  leading 
principal  minor  of  A  if  these  indices  are  consecutive  starting  at  1,  say  t'l  = 
ii  =  1,  i,  =  i2  =  2, . . . ,  in'  =  in'  =  n'.  i4  is  posHive  definite  if  Ax  >  0  for 
all  nonzero  vectors  x.  If  ,4  is  positive  definite,  every  principal  minor  is  also 
positive  definite.  If  A  is  positive  definite,  then  .4  is  always  nonsingular.  For  any 
nonsingular  matrix  A,  A^  A  is  symmetric  positive  definite. 

A  is  orthogonal  if  Ai^  A  =  I.  A  \&  lower  (upper)  triangular  if  a,;  =  0  for 
•  <  J  (J  <  *1  respectively).  .4  is  b-banded  if  Oy  =  0  for  2|j  —  i|  +  1  >  6,  so  the 
non-zeros  occur  within  only  a  band  of  width  b  around  the  diagonal.  .4  is  b-block 
diagonal  if  A  has  nonzero  entries  only  at  a  sequence  of  disjoint  6x6  blocks  on 
the  diagonal. 

We  leave  to  Subsection  2.4  a  definition  of  various  matrix  factorizations. 
2.1.3  Matrix  Norms 

Let  the  elements  of  matrix  yl  =  [a,j].  For  p  =  1,2.  oo,  let  ||/l||p  =  sup,^o 
denote  the  p-norm  of  matrix  .4,  so  ||y4(|i  =  maXj5Ij|ai;|  and  ||.4||co  =  maxi^Z^la.^l 
For  each  such  p  =  i.oo,  (see  Golub  and  van  Loan  [GL  83])  ||4||p/n  < 
max,j  |ajj|  <  |ly4||2,  for  .4  =  [a,j].  Also,  for  each  such  p  =  l,oo,  (see  Golub 
and  van  Loan  [GL  83],  p.  57)  ||i4||p/v^  <  ||A||2  <  ||A||pyn.  We  can  bound 
dcf(A)<(np||2)". 

If  i4  is  nonsingular,  let  condp{A)  =  ||A||p  •  ||A“*||p  for  p  =  1,2, oo,  and 
let  cond{A)  =  ||/l||2  ||4”‘||2.  Note  that  for  nonsingular  ,4,  cond(A)  =  ||.4||2  • 
IM'Nb  ^  so  ||y4“‘||2  >  1/||A||2.  .4  is  well  conditioned  if  cond(A)  <  n®'*’. 
Note  that  if  .4  is  well  conditioned,  then  so  is  A^ A. 

Note;  throughout  this  paper,  we  drop  the  subscript  p  in  the  case  we  wish  to 
indicate  the  '2~norm,  so  ||4||  =  ||4||2. 

2.2  Recursive  Factorization  (RF)  Sequences  of  Matrices 

Recursive  Factorization  (RF)  SEQUENCE  Problem;  If  possible,  compute  a 
sequence  of  matrices  A  =  .4o,.4i, . . . .  .4|ogn  where  for  d  =  0, 1, . . . ,  logn  -  1,  .4^ 
is  an  n/2'*  x  n/2^  matrix  which  is  partitioned  as 


where  .Y<i,  Vj,  Zj  are  matrices  each  of  size  n/2'^*^  x  n/2'‘*^  and  .4^+1  = 
Zrf  —  is  called  the  Schur  complement. 

If,  for  any  d,  no  such  factorization  is  possible,  then  .4  has  no  RF  Sequence. 
However,  any  symmetric  positive  definite  matrix  has  a  unique  RF  sequence  (see 
Golub  and  van  Loan  [GL  83]  and  Pan  (Pan  87]). 

Proposition  2.1  (follows  from  known  properties  of  Schur  complements  and 
Pan  and  Reif  [PR  85, PR  92])  In  Jie  RF  sequence,  for  all  d,0  <  d  <  logn  —  1 
and  a  6  {0, 1}**, 
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•  if  A4  is  symmetric  positive  definite,  then  so  are  Ad+i  and  Wj, 

•  l/||>lj‘||<min(|M,+i|l.||W',||). 

.  l/M<min(||.47^,||,l|W^,||-‘), 

•  cond(Wd)  <  cond(A). 

This  RF  sequence  gives  the  following  useful  recursive  formulae: 


Proposition  2.2  The  LU  factorization  of  A  can  be  recursively  computed  from 
the  RF  sequence  of  A,  as  follows: 


Ad  = 


I  O 
YdW^^  I 


Wd  O 
O  Ad^i 


/  WJ^Xd  ■ 
O  I 


Proposition  2.3  The  inverse  of  .4  can  be  recursively  computed  from  the  RF 
sequence  of  A,  as  follows: 


O  1  r  I  O' 
-YdW,-^  I  ,  • 


The  RF  sequence  provides  a  divide  and  conquer  technique  used  in  many 
theoretically  efficient  sequential  algorithms  for  matrix  inverse,  (see  [AHU  74]). 
For  exaunple,  the  Strassen  block  matrix  algorithm  shows  that  computing  the 
inverse  of  two  n  x  n  matrices  that  are  partitioned  into  four  blocks  (each  of  size 
n/2  X  n/2)  reduces  to  matrix  inverses  and  products  on  the  block  submatrices, 
(see  [AHU  74j).  The  above  formula  expresses  the  inverse  of  the  input  matrix  in 
terms  of  a  constant  number  products,  sums  and  two  inverses  of  four  n/2  x  n/2 
submatrices. 


2.3  Recursive  Factorization  (RF)  Trees  of  Matrices 

RF  TREE  Problem:  If  possible,  compute  a  full  binary  tree  of  depth  log  n  whose 
nodes  are  matrices.  All  nodes  of  depth  d,0  <  d  <  logn  are  n/2‘*  x  n/2‘*  matrices 
with  notation  Ad,a  wl»ere  a  €  {0, 1}*^  is  a  binary  string  of  length  d. 

If  d  =  0  then  Ad,a  =  A  is  the  root  of  the  tree  and  a  is  the  empty  string. 
For  0  <  d  <  log  n,  each  matrix  .4^,0  of  depth  d  has  exactly  two  children  in  the 
tree,  Ad+i,ai  and  Arf+i,ao  of  depth  d4-  I  which  will  be  defined  by  recursion.  In 
particular,  for  d  =  0, 1, . . .  ,logn  —  I  ,  Ad.a  is  an  n/2'*  x  n/2'*  matrix  which  is 
partitioned  as 

Ad, a  — 


A</+l,otO  Xd.a 
^  d,a  Xd,a 
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where  Ad+i,ao,  Xd,a,Yd,a,  are  matrices  each  of  size  71/2“*+^  x  n/2‘*'*'‘  and 
Ad+i.ai  =  Zd,a  -  Yd,aA^li  g,QXd,a  is  the  Schur  complement.  If,  for  any  d,  no 
such  factorization  is  possible,  then  A  has  no  RF  tree. 

Important  Note:  The  RF  tree  of  is  very  similar  to  the  previously  defined 
RF  sequence  A  =  i4o,i4i, ,  Aiogn-  In  particular,  for  each  d  =  l,...,logn 
Ad  =  Ad,a<  where  a  =  I**.  The  only  difference  is  that  the  RF  tree  also  recursively 
factors  each  of  the  Wd  —  matrices  appearing  in  the  RF  sequence. 

Any  symmetric  positive  definite  matrix  has  an  RF  sequence  and  therefore 
an  RF  tree,  and  it  is  unique. 

This  RF  tree  gives  the  following  useful  recursive  formulae: 

Proposition  2.4  The  LU  factorization  of  A  can  be  recursively  computed  from 
the  RF  tree  of  A,  as  follows:  Ad, a  = 

/  0]\  Ad^x.ao  O  1  [  7  A:l,^,Xd,a  ■ 

_  (,0  ^  J  [  O  J  [  O  7 

Proposition  2.5  The  inverse  of  A  can  be  recursively  computed  from  the  RF 
tree  of  A,  as  follows:  = 


7 

O 


o 


aO 


o 


'■^<<+1,01 


7 


-Yd.aA 


-1 

<1+1. aO 


o  ■ 

7 


Note:  throughout  the  rest  of  this  paper,  we  will  deal  only  with  RF  trees.  Thus 
we  will  hereafter  simply  call  an  RF  tree  an  RF,  and  we  will  call  the  RF  tree 
problem  simply  the  RF  problem  unless  otherwise  indicated. 


2.4  Reduction  of  Matrix  Problems  to  RF  Computation 

In  this  section  we  define  various  matrix  problems  and  their  efficient  parallel 
reduction  to  RF  Computation  (also,  see  Pan.  p.69  [Pan  87]).  VVe  will  consider 
a  reduction  to  be  an  efficient  parallel  reduction  if  it  can  be  done  in  Oflog"  n) 
parallel  time  using  P{n)  processors. 

1.  LU  FACTORIZATION  (if  .4  is  symmetric,  then  U  =  and  this  problem 
is  known  as  CHOLESKY  FACTORIZATION  ):  If  possible,  factor  .4  = 
LU  where  L  is  nonsingular  lower  triangular,  and  U  is  nonsingular  upper 
triangular,  otherwise  output  NO  LU  FACTORIZATION. 

If  .4  is  symmetric  positive  definite,  then  .4  always  has  a  LU  factorization. 

Note:  If  A  has  an  RF,  then  the  LU  factorization  can  be  computed  by 
0(log  n)  stages  of  matrix  multiplication  using  the  above  recursive  formula. 

Lemma  2.1  There  is  an  efficient  par- 'lei  reduction  from  LU  FACTOR¬ 
IZATION  to  the  RF  problem. 
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2.  DETERMINANT:  Compute  det(A). 

Note:  If  A  has  a  factorization  A  =  LU,  then  det(A)  =  det{L)det{U). 
The  determinant  of  a  triangular  matrix  is  obtained  by  muitiply^ig  all 
the  elements  on  its  principal  diagonal.  (This  can  be  computed  in  6>(logn) 
time  and  n/  log  n  processors  by  using  a  balanced  binary  multiplication  tree 
of  size  C)(n/logn)  whose  leaves  have  logn  elements  each.)  So  det(L)  = 
nr=i  det(U)  =  nr=i  otherwise,  A^A  is  symmetric  positive 

definite,  and  so  has  a  factorization  AJ A  =  LU.  Then  since  det(A)  = 
det{A^),  we  have  dei(A)^  =  det(AJ A)  =  det{L)det(U). 

Lemma  2.2  Then  is  an  efficient  parallel  nduciion  from  DETERMI¬ 
NANT  to  the  RF  problem. 

3.  INVERSE:  If  .4  is  nonsingular,  then  compute  .4“*  =  where  adj{.\) 

is  the  adjoint  matrix  of  A,  otherwise  output  SINGULAR. 

Note:  If  ,4  has  an  RF,  then  .4~*  can  be  computed  by  the  above  RF 
sequence  or  tree  formula.  Otherwise,  A^.4  is  symmetric  positive  definite, 
so  if  .4  is  nonsingular  then  A  has  an  RF.  Then  (A^.4)~*  =  (.4)~'(.4^)~*, 
can  be  computed  by  the  above  RF  tree  formula,  and  we  can  compute 
.4“^  =  .4((.4)~*(A^)~*)  =  .4(A^A)~*  by  one  more  matrix  product. 

Lemma  2.3  INVERSE  has  an  efficient  parallel  reduction  to  the  RF  prob¬ 
lem. 

4.  LINEAR  SYSTEM  SOLVE:  If  .4  is  nonsingular,  then  compute  .4“‘  i’.  oth¬ 
erwise  output  SINGULAR. 

Here  we  can  apply  the  efficient  parallel  reduction  given  in  our  Subsection 
1.2  to  LU  factorization  of  A^.4  (which  we  can  compute  by  Lemma  2.1  by 
efficient  parallel  reduction  to  the  RF.),  and  to  solution  of  triangular  linear 
systems,  for  which  there  are  known  efficient  parallel  algorithms  (see  JaJa 
[J  92]). 

Lemma  2.4  Then  is  an  efficient  parallel  nduction  from  LINEAR  SYS¬ 
TEM  SOLVE  to  the  RF  problem. 

5.  QR  FACTORIZATION:  If  possible,  factor  m  x  n  matrix  .4  =  QR  where 
/2  is  a  nonsingular  upper  triangular  matrix  and  Q  is  an  orthogonal  matrix 
(so  Q^Q  =  I)  of  size  m  X  n,  for  m  >  n.  If  the  QR  factorization  is  not 
possible,  then  output  NO  QR  FACTORIZATION  (Rank  deficient). 

Note:  the  Q/2-factors  of  .4  can  be  computed  from  the  LU-factors  of  .4^.4, 
where  R  =  U  and  Q  =  .  ‘  ft"  * . 
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Lemma  2.5  There  is  an  efficient  parallel  reduction  from  QR  FACTOR¬ 
IZATION  to  the  RF  problem. 

6.  HESSENBERG  REDUCTION;  Compute  a  matrix  H  =  Q^AQ  having 
upper  Hessenberg  form  Hij  =  0  if »  >  j  +  1,  and  compute  Q,  which  is  an 
orthogonal  matrix  Note;  if  .4  is  symmetric,  then  H  is  tridiagonal. 

Note;  The  Krylov  matrix  of  an  n  x  n  matrix  A  and  n-vector  t',  is  an 
n  X  m  matrix  A.'(i4,y)  =  (u,  . . . ,  Borodin  and  Munro 

([BM  75]p.l28)  describe  a  well  known  algorithm  for  the  Krylov  matrix 
requiring  2  log  m  multiplications  of  matrices  of  size  at  most  n  x  max  (n,  m). 
They  observe  that  the  matrix  powers  A,  .4*, . . . ,  42^'°*’"^  dan  bg  computed 
in  [logmj  stages  of  matrix  products,  and  that  A  (.4,v)  can  be  computed 

in  logm  further  stages,  where  using  the  identity  for  i  =  1 . [logmj, 

(j4‘'u, .4t;,  /4‘d,  . . . , «)  =  ,4"’(u,.4t;, .4"t), . . . ,  .4'’y). 

Further  Note;  If  K(A,v)  is  nonsingular  with  QR  factorization,  A'(.4.  y)  = 
QR,  then  H  =  Q^ .\Q  is  in  upper  Hessenberg  form. 

Lemma  2.6  There  is  an  efficient  parallel  reduction  from  HESSENBERG 
REDUCTION  to  the  RF  problem. 

3  Our  Parallel  Algorithm  for  Computing  an  RF 
of  a  Matrix 

Fix  a  nonsingular  n  x  n  matrix  .4,  with  integer  entries  of  size  <  2’*^ 

3.1  Deterministic  Choice  of  Modulus 

Proposition  3.1  Let  p  be  a  prime  of  size  at  least  (n||,4||)"''’  for  some  r,)  >  2. 
If  det[A)  ^  0,  then  0  ^  dei(.A)  (mod  p) 

Also,  if  det{I\{A.v))  ^  Q,  then  0  ^  det{I\{.A,  v))  (mod  p), 

3.2  Random  Choice  of  Modulus 

Proposition  3.2  (See  Schwartz  [Sc  80]  and  Pan  [Pan  87])  Let  p  be  a  random 
prime  of  the  interval  [2(n||i4(|)''’/n,2(n||.4||)'^*],  for  any  co  >  2.  If  det.(.\)  ^  0. 
then  0  ^  det(A)  (mod  p)  with  probability  >  I  —  n(n  log(7j||.4||)/(»||.4||)'‘"). 

Also,  if  det(K(A,  v))  ^  0,  then  0  ^  dct{I\{A.  y))  (mod  p)  with  probability 
>  1  -  fi(n-  log{n||.4||)/(nj(/l||f''). 
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3.3  RF  Algorithm 

We  now  describe  our  algorithm: 

Algorithm  RF 

INPUT:  sm  n  X  n  integer  matrix  A,  with  integer  entries  of  size  <  2"°*'’. 
Comment:  Since  we  can  bound  det{A)  <  all  rational  quantities  in 

the  RF  can  be  expressed  as  a  quotient  of  two  integers  of  size  <  2”°**’ . 

1.  Depending  on  whether  a  deterministic  or  randomized  algorithm  is  desired, 
do  steps  (a)  or  (b). 

(a)  Let  p  be  any  prime  from  the  interval  (njlAH)”'®  <  P  <  («||A||)"'^®  for 
some  Co, Co'  >  2. 

Comment:  By  Proposition  3.1,  if det(A)  /  0  then  0  ^  det(A)  (mod  p). 

(b)  Let  p  be  a  random  prime  from  the  interval 
2(n||A(|)‘^“/n  <  p  <  2(n||A||)‘^“,  for  some  co  >  2. 

Comment:  By  Proposition  3.2,  ifdet(.4)  ^  0  then  0  ^  det(A)  (mod  p) 
with  high  likelihood. 

2.  Let  A  =  A  +  /p(n||A(lc«)’*'' ,  for  a  sufficiently  large  constant  ci  >  8. 

Comment:  A  is  e.xtremely  diagonally  dominant,  and  thus  nonsingular, 
and  cond(A)  <  1  + 

Comment:  We  can  show  we  actually  need  to  add  a  much  smaller  quantity 
to  A,  thus  decreasing  the  bit  complexity  of  our  algorithm. 

Comment:  Since  .4  —  A  =  fp(ni|Al|oo)"*'  is  divisible  by  p,  .4  =  .4  mod  p. 
and  thus  (A)“*  =  A~^  mod  p. 

3.  Apply  Newton  Iteration  of  Section  5  (Lemma  5.4)  to  compute  in  time 
O(log' n)  using  P(n)  processors,  an  approximate  RF  of  .4  to  accuracy 
2~"  ,  for  a  sufficiently  large  constant  r  >  2. 

Comment:  This  appro.ximate  RF  gives  for  each  Aj.o,  for  d  =  0 . log  n 

and  a  €  {0, 1}“,  an  approximation  of  (.4d.a)“‘  and  the  LU  factorization 
of  Arf,o  to  accuracy  2”"\ 

4.  For  each  d  =  0, . . . ,  log  n  and  o  6  {0,  l}*^  do  in  parallel 

(a)  By  applying  Lemma  2.2  (which  showed  DETERMINANT  has  an 
efficient  parallel  reduction  to  RF),  compute  in  time  0(log‘n)  using 
P(n)  processors  an  approximation  to  det(Ad,a)  *0  accuracy  within 
2“"'  from  the  approximate  LU  factorization  of  Arf,a- 
Comment:  Since  det{Ad.a)  <  (”||'‘^ll)”'  ®nd  each  entry  in  the  ap¬ 
proximation  can  be  in  error  by  at  most  n"2~'‘  ,  the  total  error  is 
(n||A||)"n'2“'*'  <  1/2,  with  choice  of  sufficiently  large  constant  c  > 

2. 
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(b)  Round  this  approximation  to  the  nearest  integer  to  compute  det{Ad,a ) 
exactly. 

(c)  If  det{Ad,a)  =  0  (mod  p)  for  any  d,  a  then  exit  and  OUTPUT  NO 
RF  of  a! 

Comment:  In  this  case  det{Ad,a)  =  0  implying  Ad,a  is  singular,  so 
A  has  no  RF. 

(d)  Multiply  this  exact  value  of  dcf(i4d,„)  by  the  approximation  of  (.4^,0  )“* 
to  get  2U1  approximation  of  the  integer  adjoint  matrix  adj(Ad,a)  to 
accuracy  within  2~”‘. 

(e)  Round  this  approximation  to  the  nearest  integer  to  compute  adj(Ad,a ) 

exactly,  and  thus  the  exact  rational  value  of  (/!</, o)~'  =  j  ■ 

Comment:  We  now  have  the  exact  RF  of  i4. 

Comment:  Recall  A  =  A  mod  p,  so  the  RF  (mod  p)  of  .4  is  identical 
to  the  RF  (mod  p)  of  .4. 

5.  Reduce  (mod  p)  the  exact  RF  of  ^4,  yielding  the  RF  (mod  p)  of  .4 

6.  Apply  Newton-Hensel  Lifting  of  Section  4  (Lemma  4.1)  to  compute  in 
time  0(log*n)  using  P(n)  processors  the  RF  (mod  p"')  of  .4,  which  is 
the  same  as  the  exact  RF  of  .4. 

OUTPUT  The  exact  RF  of  A. 

Note  that  by  Lemmas  5.4,  2.2,  and  4.1,  each  major  stage  of  the  above  RF 
algorithm  takes  at  most  time  0(log'  n)  using  P(n)  processors.  Since  there  are 
only  7  such  major  stages,  we  have: 

Theorem  3.1  Oar  algorithm  for  the  exact  RF  lakes  parallel  time  0(log'  n) 
using  P{n)  processors.  ^ 

By  Theorem  3.1  and  the  efficient  parallel  reductions  of  Subsection  2.4,  we  have 
the  following  new  results: 

Corollary  3.1  The  following  problems  can  be  exactly  solved  in  parallel  time 
O(log'n)  using  P(n)  processors: 

•  LU  FACTORIZATION 

•  QR  FACTORIZATION. 

Note:  the  RF  will  still  be  constructed  for  input  which  arc  non-symmetric 
positive  definite  matrices,  so  symmetry  of  the  input  matrix  is  not  essential 
(it  just  simplifies  the  analysis  (Lemma  5.4)  of  the  pipelined  Newton  iterations 
described  in  Section  5).  However,  the  RF  algorithm  on  an  input  matrix  which 
is  singular  will  not  result  in  a  complete  RF. 
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By  a  slight  modification  of  the  above  construction  (see  Pan  [Pan  87]),  we  can 
compute  a  HESSENBERG  REDUCTION.  Let  H  =  [A,j]  be  the  n  x  n  matrix 
where 

if  j  —  i  =  1  mod  n, 
otherwise. 


*«  =  {i 


Let  u  =  [1,0, ...  ,0]  and  A"  =  ffi4+/fp(n||i4||oo)"*',  for  an  appropriate  constant 
Cl  >  3  and  for  a  prime  p  as  in  the  RF  algorithm. 

Pan  [Pan  87]  shows  that  the  Krylov  matrix  K{A'\  v)  =  [v,  A"v . A""~^v] 

is  strongly  diagonally  dominant,  and  that  same  proof  implies  I\{A",  c)  is  ex¬ 
tremely  diagonally  dominant.  As  Pan  [Pan  87]  notes,  some  simple  modifica¬ 
tions  of  our  RF  algorithm  allow  one  to  compute  the  HESSENBERG  REDUC¬ 
TION.  We  can  compute  (H^A")~^  using  the  RF  algorithm,  because  H^A"  = 
,4  +  Afp(n||A||oo)"‘^'  is  strongly  diagonally  dominant,  and  because  we  have  al¬ 
ready  computed  dct(//^.4")=  det(H^)det{A")=  (-ll^'^dcffA").  Compute 
(H^ A")~^det{H^ A")=  adj(H^ A").  Then  we  can  compute  the  integer  adjoint 
matrix  adj(A)  via  reduction  modulo  p  and  finally  compute  .4“‘  = 


Corollary  3.2  HESSENBERG  REDACTION  can  be  exactly  solved  in  parallel 
time  0(log"n)  using  P(n)  processors. 


Note  that  our  parallel  time  bounds  for  LU  FACTORIZATION,  QR  FAC¬ 
TORIZATION  and  HESSENBERG  REDUCTION  are  new.  Previously  Pan 
[Pan  87]  had  proved  the  same  processor  bounds  for  these  listed  problems  with 
a  time  bound  of  0(log^  n). 

The  following  additional  problems  were  previously  known  to  be  exactly  solv¬ 
able  (see  [Pa  85, Pan  87, KP  91, KP  92,J  92J)  in  time  O(log"n)  using  P{n)  pro¬ 
cessors,  requiring  reduction  to  the  computation  of  the  characteristic  polynomial. 

«  DETERMINANT 

•  INVERSE 

•  LINEAR  SYSTEM  SOLVE. 


Our  techniques  achieve  the  same  bounds,  without  the  need  to  compute  the 
characteristic  polynomial. 


3.4  Factorization  of  Block  Diagonal  Matrices 


Recall  that  A  is  b-block  diagonal  if  A  has  nonzero  entries  only  at  a  sequence  of 
6x6  disjoint  blocks  on  the  diagonal. 

CoroUary  3.3  If  .4  is  b-block  diagonal,  then  the  following  problems  can  be  ex¬ 
actly  solved  in  parallel  time  O(log*6)  with  P(b)n/b  processors: 

•  RF 


•  LU  FACTORIZATION 

•  QR  FACTORIZATION 
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4  Newton-Hensel  Lifting 

4.1  Newton-Hensel  Lifting  of  a  Matrix 

Fix  a  nonsingular  n  x  n  matrix  A  and  a  prime  p.  In  this  section  we  assume 
we  have  already  computed  A~^  mod  p.  Zassenhause  extended  Hensel's  lifting 
to  exponentially  increase  the  modulus.  The  resulting  Newton-Hensel  Lifting  is 
the  following  algorithm: 

INPUT:  a  positive  number  k,  and  n  x  n  matrices  A  and  A~^  mod  p. 

1.  =  .4"^  mod  p 

2.  For  i  =  1, . . .  do  mod  p^*. 

Moenck  and  Carter  [MC  79]  show 

Proposition  4.1  5^*^  =  .4“*  mod  p**. 

4.2  Newton-Hensel  Lifting  of  Modular  Recursive  Factor¬ 
izations 

Recall  that  the  RF  of  A  is  a  full  binary  tree  of  depth  log  n.  Each  internal  node 
is  an  n/2^  x  0/2**  matrix  Ad,a-  The  RF  of  .4^,0  is  defined  in  terms  of  the  RF  of 
two  n/2'^*^  X  0/2**+^  matrices,  namely  •4,j+i,oi  and  .4rf+i,oo,  and  furthermore 
•4<j+i,ai  is  defined  in  terms  of  submatrices  of  A^.a  and  the  inverse  of  .4rf+i,oo. 

Let  an  RF  (mod  p)  of  an  n  x  n  matrix  A  be  an  RF  of  matrix  .4  where 
each  element  is  taken  (mod  p).  This  will  also  be  called  a  modular  RF.  In 
this  section  we  assume  we  are  given  an  RF  (mod  p)  of  matrix  .4  .  VVe  shall 
compute  the  RF  (mod  p-*)  of  /I. 

Recall  that  P[n)  =  rv^  is  the  minimum  number  of  PRAM  processors  neces¬ 
sary  to  multiply  two  n  x  n  matrices  in  (?(logn)  parallel  steps,  where  we  assume 
w  >  2. 

Proposition  4.2  EdJo"  2‘‘0(P(n/2‘'))  <0(P(n)). 

Note  that  the  obvious  way  to  compute  the  RF  (mod  p"*)  of  .4  is  by  pro¬ 
ceeding  level  by  level  through  the  RF  in  logn  stages,  each  requiring  O(l‘logn) 
time  and  0(P(n))  >  2'‘0(P(n/2‘‘))  processors  for  the  required  k  matrix  prod¬ 
ucts  for  the  nodes  of  depth  d.  This  requires  total  parallel  time  0(k  log'  n)  using 
0(P{n))  processors. 

Newton-Hensel  Lifting  Algorithm  for  an  RF 

INPUT:  RF  (mod  p)  of  A  and  number  it  >  1.  , 

For  each  d, a  for  0  <  d  <  logn  and  «  €  {0, 1}"*,  do  in  parallel: 

1.  INITIALIZATION; 

(a)  Let  A^°l  =  A^.a  (mod  p) 
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(mod  p). 


(b)  Let  5^+1  oo  —  ■^d+i.oo 

2.  For  each  i  =  1, . . . ,  ib  do 

Loop  Invariant:  We  have  just  computed 
=  i4rf,a  mod  p-'~'  and 
Sd.2^^  =  Ad,l,  modp2‘"'. 


(a)  Let  5<‘i  =  (2/  -  Ad,aS^j-\ 

(b)  For  d  <  logn,  let 


•^d+l.oO 

v(*i 

^d.a 


V<‘) 

‘'(1,0 


where  ao* 1  matrices  each  of  size  n/2‘*'*’*  x 
n/2‘'+‘and' 

j(*)  _  7(0  vO) 

—  ^d,Q  “  ^d,a^d^l,QO'^d,a' 


OUTPUT:  RF  (mod  p-* )  of  A  given  by  the  {.4'^*^}. 

This  algorithm,  as  stated,  requires  parallel  time  0(iblog-’  n)  using  0(P(n)) 
processors.  However,  we  can  use  the  technique  of  stream  contracUon  (see  Pan 
and  Reif  [PR  91])  to  decrease  the  time  to  0((ib  +  logn) logn)  time  without  a 
processor  penalty.  The  idea  is  to  note  that  for  each  d, 0  <  d  <  logn,  and  each 
a  €  {0, 1}''  defining  an  internd  node  Ad,a  of  depth  d,  the  following  hold; 

1.  The  matrices  /Ij+i  ao  mod  p*'  '  and  .dd+i.oi  mod  p-’  '  are  defined  in 
terms  of  submatrices  of  Ad,a  mod  p"' . 

2.  The  computation  5^'^  =  mod  p*'  depends  on  the  previous  computa¬ 
tions 

(a)  P'”'  a"*! 

(b)  *^d+"i‘aO  =  mod  p2'". 

This  implies  that  we  can  pipeline  the  computation  as  follows;  As  our  basis 
step  t  =  0  we  have  done  the  computation  for  all  d  =  0, . . .  log  n  and  each  a  6 
{0, 1}“*.  We  assume  for  our  induction  hypothesis  that  at  time  f.  0  <  <  <  fc-f-log  n 
we  have  computed  each  Sj/l  for  all  i,d,a  where  0<d<logn,0<«<f  —  d 
and  a  €  ^0, 1}“*. 

Then  we  apply  the  RF  formula  and  compute  one  further  product  at  each 
node  of  the  recursion  tree  to  compute  by  time  t  +  1  each  '>J|*^**  for  all  i,d,  a 
where  0  <  d  <  logn,i  =  t  I  —  d  and  a  6  {0, 1}“*. 
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Summarizing,  we  have  (using  a  small  constant  factor  slowdown  to  reduce 
the  processor  bounds  from  0(P(n))  to  P(n)) 

Lemma  4.1  Given  an  RF  (mod  p)  of  an  nun  matrix  A,  we  can  compute  an 
RF  (mod  p®  )  0/  A  in  parallel  lime  0{(fc  +  logn)logn)  usin^  P(n)  processors. 

5  Newton  Iterations  for  Approximation  of  an 
RF  of  Diagonally  Dominant  Matrices 

5.1  Extremely  Diagonally  Dominant  Matrices 

Let  A  be  an  n  X  n  matrix  where  A  =  [a,j].  A  is  (-diagonally  dominant  for  some 
e  >  0  if  for  all  i,  I  <  i  <  n, 


fkil> 

Let  D(A)  ~  diag(aii, a^i,  ■ .  ■ , Onn )  denote  the  n x  n  diagonal  matrix  with  the  di¬ 
agonal  entries  an.ajSi •  •  ■  .<*nn-  Then  £)(A)”‘  =  diagll/aw.  1/022,  •  •  • .  l/unn) 
is  the  n  X  n  diagonal  matrix  with  the  diagonal  entries  l/on,  1/023,  ■  ■ . .  1/0 

nn* 

Since  1|/  -  D(A)“‘ A||oo  <  max,  >1  follows: 

Proposition  5.1  If  A  is  (-diagonally  dominant  then  ||/  —  D(A)”*A||c«  <  (. 

We  generally  will  let  =  Z?(A)~*  be  the  initial  approximation  to  the  inverse 
of  A  in  our  Newton  iterations.  A  is  diagonally  dominant  if  A  is  (-diagonally 
dominant  for  (  <  1.  A  is  strongly  diagonally  dominant  if  A  is  (-diagonally 
dominant  for  e  =  1  —  l/n'  for  some  c  >  0.  Define  a  quantity  to  be  extremely 
small  if  it  is  of  the  form  (  =  l/(n(|At|)'^  for  some  large  constant  r  >  2.  (This 
definition  of  extremely  small  is  required  for  our  proofs  of  rapid  convergence; 
but  in  practice,  it  is  possible  we  could  alter  this  definition  of  extremely  small 
so  that  (  is  a  larger  quantity,  thus  often  decreasing  the  bit  complexity  of  our 
algorithm.)  .-I  is  extremely  diagonally  dominant  if  A  is  e-diagonally  dominant 
for  an  extremely  small  (.  Let  a  nonsinguiar  .-1  be  extremely  well  conditioned  if 
condA  <  1  e  for  an  extremely  small  (.  Let  A  «  A  if  ||/l  -  .4||  is  extremely 
small.  The  following  follows  from  basic  norm  proprieties  (see  also  [K  81, K  87]): 

Proposition  5.2  Suppose  A  is  extremely  diagonally  dominant.  Then: 

•  A  is  extremely  well  conditioned, 

•  If  Ass  A  then  A  is  also  extremely  diagonally  dominant, 

•  A  %  D(A),  so  A  is,  within  extremely  small  relative  error,  a  diagonal 
matrix  (but  of  course  A  is  not  identical  to  D{A)). 
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•  sss  D{A)~^,  so  the  inverse  of  A  is,  within  extremely  small  relative  er¬ 
ror,  also  a  diagonal  matrix  (but  of  course  is  not  identical  to  D(A)~^). 

•  If\\I  —  fli4||oo  ts  extremely  small,  and  (|.-l||oo  >  1.  then  ||fl||«.  <  2. 

5.2  Approximate  Inverse  of  Diagonally  Dominant  Matri¬ 
ces 

Let  .4  be  an  n  X  n  matrix.  We  will  let  =  D(.4)~'  be  the  initial  approxi¬ 
mation  to  the  inverse  of  .4  in  our  Newton  iterations.  The  Newton  iteration  (see 
Ben-Israel  [B66],  Ben-Israel  and  Cohen  [BC  66].  and  Hotelling  [H  43. Ht  4.3]  for 
sequential  Newton  iteration  and  Pan  and  Reif  [PR  8.3, PR  89]  for  parallel  appli¬ 
cations)  generates  a  sequence  of  matrices  R'"’. . . .  where 

g{k)  _  B(*-U(2/  _  .4fl<*-‘>). 

Proposition  5.3  If  A  is  (-diagonally  dominant  then  ||/  —  S'*'.4||o.,.  <  i  -  . 
This  follows  since. 

I  -  S'*-'’.!  =  /-  S“’-"(2/  -  .1R'*~*’).1 
=  (/-  B'*-‘’.4)v 

Thus 

IK  -  fl‘*'.4|U  =  ||(/-  S“-".4)-||,,  <  (/-*■')-  = 

Thus  if  .1  is  strongly  or  extremely  diagonally  dominant,  (hen  k  =  0(log;i) 
Newton  iterations  suffice  to  compute  the  approximate  inverse  S**'  with  error 


5.3  Newton  Iterations  with  Inaccurate  Input 

We  assume  .4  is  extremely  <liagonally  dominant,  so  .1  is  f-<liagonally  dominant 
for  some  extremely  small  f  =  l/(»i||.4||)^"  for  some  snIRriently  large  c  >  S. 
Let  Et(()  =  f-* -3'* .Note  that  En{<)  =  '■  1 (O)"  <  -'md  .d.so 

£j.  iog„(f)  =  for  a  sufficiently  large  constant  r'  >  1. 

Next  we  consider  the  case  of  Newton  iteration  where  tlie  input  matrix  .1 
is  not  initially  given  with  full  accuracy,  and  instead  we  are  provided  on-line 

a  sequence  of  approximates  T‘”*..4**’ -  where  ||.4  —  .4'*’||t>j  <  L\.(< )  Let 

=  di<i!7(  1 /dll.  1/022. - 1/dnn)  where  .T"'  =  [fi.j]. 

If  /i'”’  is  c-diagonally  dominant  then  by  Proposition  .3.1.  |K-  R''”.4'"’||>j  < 

f.  We  will  generate  a  sequence  of  matrices  R*  *’ _ where  R*^  ’  =  R**  “'  *(2/  — 

4(t)g(*-i)) 

Since  .4  is  assumed  to  be  extremely  diagonally  dominant,  then  by  as.snmption 
on  the  close  approximation  of  .4  Ity  the  .4**'’,  by  Proposition  .3.2  we  have: 
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Proposition  5.4  Each  is  also  extremely  diagonally  dominant,  and  ez~ 
tremely  well  conditioned. 

Finally,  we  give  an  inductive  proof  of  quadratic  convergence  of  the  entire 
iteration  sequence: 

Proposition  5.5  If  ||i4||oo  >  1,  -A  and  are  extremely  t-diagonally  domi¬ 
nant,  then  we  have  ||/  —  <  Et{€). 

Proof:  We  use  a  proof  by  induction  on  k.  Note  that  the  Basis  holds  by  as¬ 
sumption,  since  e  =  Now  for  k  >  1,  assume  the  induction  hypothesis: 

By  definition  of  the  Newton  Iterations, 

/  - 

=  (/- 

By  assumption  on  the  approximations  to  the  .4***, 

IM<*'  -  <  \\A  -  .4<‘'||oo  +  \\A  -  .4'‘-‘>||co  <  EUc)  + 

Since  by  Proposition  5.4,  .4**~‘*  is  extremely  diagonally  dominant  and  by 
the  induction  hypothesis,  ||/  -  fl'*“**.4<‘'~‘>||co  <  Bt-ift),  and  by  assumption 
Mlloo  >  li  Proposition  5.2  implies  ||S**“‘*|jco  <  2. 

Also, 

I  —  =  (  /  —  —  ,4^*’), 

so  we  have  the  bound: 

11/  _  <  ||/  _  ||5“-"|U  •  -  -4“’|U 

<  £:*_i(f)  4-  2(E*(0  4-  (f) 

since  Ek{c)  <  Efc-ife). 

Thus  we  have: 

11/  -  B<*'.4<‘'|U  =  ll(/  -  B<*-'’.4<*')-|U 

<  ||(/  -  <  (5Et_,(e))=  <  £t(e), 

by  definition  of  Et(c).  | 

Lemma  5.1  If  A  is  extremely  diagonally  dominant,  then  even  with  the  above 
approximations  to  A,  k  =  O(logn)  Newton  iterations  still  suffice  to  compute  the 
approximate  inverse  B**’  with  error  2"'* 

(Note:  we  ca.i  also  show  that  Lemma  5.1  holds  even  if  .4  is  more  moderately 
diagonally  dominant,  but  the  Lemma  will  suffice  for  the  RF  algorithm.) 
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5.4  Non-Pipelined  Newton  Iterations  for  RF 

In  this  subsection  we  assume  we  are  given  a  matrix  A  which  is  extremely  diag¬ 
onally  dominant.  We  shall  compute  an  approximation  to  the  RF  of  A  within 
error  2“"‘  for  any  given  constant  c>  1. 

Recall  again  that  the  RF  of  .<4  is  a  full  binary  tree  of  depth  log  n.  Each 
internal  node  is  an  n/2^  x  n/2^  matrix  A4,a-  The  RF  of  Ad,a  is  defined  in  terms 
of  the  RF  of  two  n/2^'*'^  x  n/2‘*'*'‘  matrices,  namely  i4<i+i,ai  snd  .4^+1, ao.  and 
furthermore  i4,<.(.i,ai  is  defined  in  terms  of  submatrices  of  Ad,a  and  the  inverse 
of  i4^+l,a0- 

Fix  some  k  —  ci  log  n,  for  some  appropriate  constant  ci  >  2.  Note  that  the 
simplest  and  most  obvious  way  we  could  use  to  compute  the  RF  of  A  within  error 
2~"°  is  by  proceeding  up  the  recursion  tree  in  log  n  stages.  We  first  sketch  this 
simple  algorithm  (we  will  give  full  details  of  a  more  efficient  algorithm  below). 
At  each  stage  d  =  0, 1, . . .  ,logn,  we  could  compute  for  all  nodes  of  depth  d  an 
approximation  A<j,q  of  Ad,a  within  error  2“"'' .  Then  we  could  compute 
which  is  a  ib-th  Newton  iteration  matrix  approximating  the  inverse  of  matrix 
Ad,a  up  to  error  €'  ,  where  t  =  l/(n(|A||)'  ",  for  some  sufficiently  large  constant 
d  >2.  We  could  use  to  approximate  within  error  2“"'' .  Since  there 
are  only  logn  levels  in  the  RF,  and  since  ,4  is  extremely  diagonally  dominant, 
the  norm  bounds  given  in  Proposition  2.1  would  ensure  that  the  error  of  the 
overall  approximate  computation  is  at  most  2*"'.  At  each  stage  we  would  need 
to  compute  approximations  to  two  recursive  inverses.  Each  such  stage  requires 
0(log*n)  time  and  0(P(n))  >  2‘^0(P(n/2'^))  processors  for  the  required  k 
repeated  n/2‘*  x  n/2‘*  matrix  products  for  each  of  the  2**  nodes  of  depth  d. 
The  logn  stages  would  require  total  parallel  time  0(/blog'  n)  =  O(log^n)  using 
Eltfo"2‘'0(/"(n/2<'))  <  0{P{n))  processors. 


5.5  Pipelined  Newton  Iterations  for  RF 

Next,  we  will  improve  on  this  0{log^n)  algorithm  for  *he  appro.ximate  RF. 
decreasing  the  time  to  0(log~n)  without  an  increase  in  processor  bounds.  As 
in  Lemma  .5.1.  we  consider  the  case  of  Newton  iteration  where  the  recursively 
defined  matrices  Ad,a  not  initially  given  with  full  accuracy,  and  instead 
we  are  provided  a  sequence  of  improving  approximations  ,  .4 , . . .  where 
|[A  —  A^*’||oo  <  Pk(i)-  In  the  special  case  d  =  0,  the  approximation  is  exact, 
Ao,<>  =  A,  where  <>  is  the  empty  string. 

Given  these  approximants  to  .4,  we  will  generate  a  sequence  of  matrices 


=  Bj*“‘'(2/  —  Thus,  will  approximate  the  inverse 


(t) 


of  A 


i(*) 


d,»- 


We  will  then  use  Lemma  5.1  to  show  quadratic  convergence. 
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Newton  Iteration  Algorithm  for  an  RF 

INPUT:  An  extremely  e-diagonally  dominant  matrix  .4  of  size  n  x  n. 

For  each  d,a  for  0  <  d  <  iogn  and  a  6  {0, 1}**,  do  in  parallel: 

1.  INITIALIZATION:  Let  =  .4,  where  <>  is  the  empty  string. 

2.  Let  k  =  Cl  log  n,  for  some  appropriate  constant  ci  >  2. 

3.  For  each  i  =  1, ....  i  do 

Loop  Invariant:  and 

•  We  have  just  computed  -4^"**  which  approximates  Aj  a  with  error 

-  AhAco  <  Ei.iit')  for  c'  =  l/(nl|A||)''"  with  c'  >  2.  and 

•  we  have  just  computed  Sjo'*  which  approximates  .4J^  with  error 
specified  by  ||/-  fly~‘'A<<.o||oo  <  fi'.-ilf')- 

(a)  If  i  =  0  then  let  =  diag(l/aii,  l/aoj, - l/«nn).  where  .4j°^  = 

Else  if  .■  >  0  then  let  (2/  -  .4';), fl‘ 

(b)  For  d  <  logn,  let 


L  ^d.o  J 

where  y]‘J,  are  matrices  each  of  size  n/2'*'*'*  x 

n/2‘'+‘  and 

It*)  _  Y'(*)ra(0  y-(>) 

“  ^d.Q  “  ’  d,a^d^l.aO'^d,o’ 

OUTPUT:  Approximate  RF  {aJ*^},  which  approximates  the  RF  of  .4 
within  2-norm  error  ^'712"'*"^'  <  2“"  ,  for  some  c  >  1 

We  now  show  that  in  the  initial  approximation  to  the  IlF  of  .4. 

the  errors  do  not  accumulate  as  much  on  the  recursions.  Recall  that  we  have 
recursively  defined:  A<<+i,oti  =  Z<i,a  —  Id.aA^+i  T*'*®  implies: 

Proposition  5.6  (agatn  follows  from  known  properties  of  Schur  complements 
and  Proposition  2.1  Pan  and  Retf  [PR  8.5, PR  92])  In  the  RF.  for  all  d.D  <  d  < 
logn  —  1  and  or  €  {0. 1}"^, 

•  if  the  Ad.a  M  symmetric  positive  definite,  then  so  are  Aj+i  ai,  Aj+i.ao, 

•  ||Ad+i,oi||,||Ad+i,.ol|  <  l|Ad,ol|, 
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•  l/II^J,all  <  min(||/ld+i,ai||,||^d+l.ao||), 

•  l/lMd.all  <  min(||ylj^\  „ol|), 

•  cond{Ad, ao)  <  cond{A)  . 

Since  A  is  extremely  diagonally  dominant,  by  definition  A  is  e-diagonally 
dominant  for  e  =  l/(n||i4||)*"  for  some  c>  2.  By  Proposition  5.2,  ,4  is  extremely 
well  conditioned.  Since  we  have  cond(Ajao)  <  cond(yi),  each  .djoo  is  also 
extremely  well  conditioned. 

Recall  we  have  defined  a  quantity  to  be  exinmely  small  if  it  is  of  the  form 
f'  =  l/(n((i4||)® "  for  some  c'  >  2.  By  Proposition  5.2,  .4  and  ail  the  matrices 
Ad,a  io  the  RF  of  A  are,  at  least  within  extremely  small  relative  error,  diagonal 
matrices.  By  Proposition  5.2,  the  inverse  of  such  nearly  diagonal  matrices  are 
also,  with  extremely  small  relative  error,  diagonal  matrices.  Since  our  initial 
approximations  to  the  inverses  of  matrices  occurring  in  the  RF  are  themselves 
diagonal  matrices,  it  follows  by  Proposition  5.2,  that  our  initial  approximations 
to  its  inverse  gives  an  approximation  to  the  RF.  with  extremely  small  relative 
error  f'  =  l/(n||i4||)‘^  ”  for  some  c'  >  2. 

Lemma  5.2  Let  e'  —  l/(n(|/l||)' ”  for  some  c'  >  2.  In  the  Newton  Iteration 
Algorithm  for  the  RF,  at  each  depth  d  ~  1, . . . ,  log  n.  and  for  each  a  6  {0, 1)“*.  on 
the  first  stage  of  Newton  iteration,  the  initial  approximation  of  the  RF  matrices 
at  depth  d,  have  extremely  small  no-norm  error.  In  particular. 

ii<:--4,,,iioo<c' 

and 

IK  -  <  f'. 

Lemma  5. .3  bounds  the  errors  of  the  i-th  approximate  RF  {.4|j'^}.  It  follows 
by  induction  directly  from  Lemma  -5.1  using  it  as  a  basis  for  the  induction 
Lemma  5.2. 

Lemma  5.3  For  each  d  =  I, . . . ,  log n.  and  each  a  G  {0, 1  )'^  the  nc-norm  error 
for  the  i-th  iteration  is  a  (t')*  .  In  particular, 

ll^L  -  <  Eile') 

and 

IK  -  <  Eiie'). 

As  in  the  Subsection  4.2,  we  use  the  technique  of  stream  contraction  to 
decrease  the  time  to  0(log*n)  time  without  a  processor  penalty.  To  simplify 
notation,  let  us  define  „  to  be  the  computation  of  aJ/I,  and  Bj^,  where  this 
approximate  computation  is  within  error  specified  by  Lemma  5.3. 
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The  idea  is  to  note  that  for  each  d,0  <  d  <  logn,  and  each  a  €  {0, 1}*^ 
defining  an  internal  node  Ad^a  of  depth  d,  the  computation  ^  depends  only  on 
the  computations  and  ^o-  Furthermore,  the  computation  <,1 

is  defined  in  terms  of  submatrices  of  and  of  In  particular, 

we  need  only  to  apply  Newton  iteration  one  more  time  to  these  approximated 
matrices. 

This  implies  that  we  can  pipeline  the  computation  of  ^  just  as  we  did  in 
the  previous  section  for  5^^  (except  we  use  here  k  =  O(logn))  to  reduce  the 
parallel  time  from  Q(log^  n)  to  0(log' n)  without  a  processor  increase. 

We  pipeline  the  computation  as  follows;  As  our  basis  step  t  =  0  we  have 
done  the  computation  4^^  for  all  d  =  0, ...logn  and  each  a  6  {0, 1}*^.  We 
assume  for  our  induction  hypothesis  that  at  time  1,0  <  t  <  t  +  log n  we  have 
done  the  computation  „  for  all  i,  d,  o  where  0  <  d  <  log  n,  0  <  i  <  1  -  d,  and 
«€  {0, 1}**. 

Then  we  apply  the  approximation  RF  formula  and  perform  one  more  product 
at  each  node  of  the  recursion  tree  to  do  computation  by  time  <  +  1  for  all 
i,  d,  o  where  0  <  d  <  log  n  and  i  =  <  +  1  -  d  and  a  €  {0, 1}“^. 

Summarizing,  we  have  (using  a  small  constant  factor  slowdown  to  reduce 
the  processor  bounds  from  0(P(n))  to  P(n)) 

Lemma  5.4  Given  annxn  matrix  A,  which  ts  extremely  diagonally  dominant, 
we  can  compute  an  approximation  to  (he  RF  of  A  with  error  2“”  "  in  parallel 
time  O(log'n)  using  P(n)  processors. 

6  Symmetric  Matrices  with  Separable  Graphs 

A  family  of  graphs  is  s{n)-separable  if,  given  a  graph  G  in  the  family  of  n  >  C)(  1 ) 
nodes,  we  can  delete  a  set  of  s(n)  nodes,  separating  G  into  subgraphs  in  the 
family  of  size  <  2/3n  nodes.  Clearly  d-dimensional  grids  or  dissection  graphs  are 
s(n)  =  0(n"5“ )  separable,  and  Lipton  and  Tarjan  [LT  79]  showed  planar  graphs 
are  0(v^)-separable.  A  sparsity  graph  of  a  symmetric  matrix  has  a  vertex  for 
every  row  (column)  of  the  matrix  and  an  edge  wherever  there  is  a  non-zero  entry 
of  the  adjacency  matrix.  Matrices  with  separable  sparsity  graphs  arise  naturally 
from  VLSI  circuit  problems,  structure  problems,  and  discretization  of  2-  or  .3- 
dimensional  PDEs.  For  example,  d-dimensional  PDEs  result  in  matrices  whose 
sparsity  graphs  are  d-dimensional  grids  or  related  dissection  graphs  which  are 
s(n)  =  0(n“7“)  separable. 

Let  A  be  an  n  X  n  symmetric  positive  definite  matrix  A  with  a  s(n  )-separable 
sparsity  graph.  Lipton,  Rose,  and  Tarjan  [LRT79j  and  Pan  and  Reif  [PR  8.3, 
PR  92]  define  a  nested  dissection  ordering  which  is  used  to  guide  the  Gaussian 
elimination  process  of  the  sparse  mat'*x  so  as  to  minimize  fill-in.  We  assume 
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the  matrix  has  already  been  pre  and  post  multiplied  by  a  permutation  matrix 
so  that  the  resulting  matrix  A  has  the  rows  and  columns  in  this  order. 

Using  this  ordering  for  sequential  Gaussian  elimination, 

Lemnui  6.1  (Lipton,  Rose,  and  Tarjan  [LRT79])  Given  an  n  x  n  symmetric 
positive  definite  matrix  .4  with  an  8(n)- separable  sparsity  graph,  a  recursive  LU 
factorization  can  be  computed  in  sequential  time  0(s(n)^),  exactly  solving  in 
the  same  time  bound  the  problems  DETERMINANT  and  LINEAR  SYSTEM 
SOLVE. 


In  the  Parallel  Nested  Dissection  algorithm  of  Pan  and  Reif  [PR  85, PR  92), 
this  nested  dissection  ordering  is  specified  by  a  vertex  partition  sequence  V'o,  Vi , . . . ,  14 , 
for  k  =  O(logn)  which  is  used  to  guide  the  parallel  elimination  process.  Let 
no  =  n  and  let  n^+i  =  n^  -  |V<i|  for  d  =  0, . . .  ,ib.  Assuming  an  s(n)-separable 
sparsity  graph  (see  Lipton,  Rose,  and  Tarjan  [LRT79]),  we  define: 

Nested  Dia8ection(ND)  RF  Sequence  Problem: 

If  possible,  construct  a  sequence  of  matrices  A  =  Ao,  .Ai,  Ao, . At,  where 

k  =  O(logn)  and  for  d  =  0, 1 . —  1,  Aj  is  an  n^  x  matrix  which  is 

partitioned  as 


.A<,= 


Wj  Xj  ■ 


where 


•  tid+i  =  n<i  -  |V<i| 

•  iVd,Xj,Vd,Zd  are  matrices, 

•  Wd  is  a  block  diagonal  matrix  of  size  \Vd\  x  jV<i|  where  each  block  is  of  size 
s(n<i)  X  s(nd), 

•  Xi  is  of  size  |Vj|  x  nj+i, 

•  Yd  is  of  size  nj+i  x  \Vd\  (if  A  is  symmetric,  then  Yd  =  (A4)^), 

•  Zd  is  of  size  nj+i  x  nj+i,  and 

•  Ad+i  ■=  Zd  —  YdWj'^Xd  is  the  Schur  complement. 

Note  that  Pan  and  Reif  [PR  85, PR  92]  show  that  the  norm  and  condition 
bounds  of  Proposition  2.1  hold  for  any  ND  RF  sequence. 

Pan  and  Reif  [PR  85, PR  92]  show  that  an  ND  RF  sequence  can  be  computed 
in  parallel  time  0(log^  n)  using  P{s{n))  processors,  also  giving  in  the  same  time 
bounds  solutions  to  the  problems  DETERMINANT  and  LINEAR  SYSTEM 
SOLVE.  The  proofs  in  Pan  and  Reif  [PR  85, PR  92]  show  that  the  work  for 
ND  RF  sequence  is  dom'nated  by  the  computation  of  0(n/s(nj))  products  and 
inverses  of  a  sequence  of  dense  submatrices  of  size  s{nd)  x  s(nd)  each  requiring 
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O(log"n)  time  and  0(P(s(nd))n/s(n<j))  processors  for  d  =  0, . . . ,  it  =  O(logn). 
Thus  the  total  time  is  O(log^n)  and  the  processor  bound  is 


J]0(P(s(n,))n/s(nd))  <  0(P(s(n))), 
d=0 

where  s(n)  is  of  the  form  n'*'  for  7  >  0. 

We  now  derive  a  parallel  algorithm  in  this  case,  and  reduce  the  parallel 
time  from  fi(log^  n)  to  0(log"  n)  while  still  using  P(s(n))  processors.  We  use  a 
modified  form  of  our  (balanced)  RF  algorithm  on  the  appropriately  permuted 
input  matrix. 

The  partitioning  of  blocks  is  again  altered  depending  on  the  separator  struc¬ 
ture,  following  the  usual  techniques  used  in  the  above  ND  RF  sequence.  Again 
the  nested  dissection  ordering  is  specified  by  the  vertex  partition  sequence 
Vo,V2, . . .  ,Vie  ,  (or  k  =  O(logn)  which  is  used  to  guide  the  parallel  elimina¬ 
tion  process.  Let  rid  be  as  defined  above. 

ND  RF :  If  possible,  compute  a  binary  tree  of  depth  k  =  0(log  n)  whose  nodes 
ate  matrices.  Each  node  of  depth  d,  0  <  d  <  fr  is  a  a  ^  matri.x  .-Irf  a 
where  a  €  {0, 1}*^  is  a  binary  string  of  length  d,  and  nj  a  is  defined  recursively 
below.  The  node  is  a  leaf  if  nd.a  =  I- 

For  each  d,  0  <  d  <  <;,  we  specify  the  string  1'*  to  be  special.  For  each 
a  €  {0,1}“^,  if  a  is  special  and  nj  a  >  1.  then  we  recursively  decompose  the 
matrix  (using  the  ND  RF  sequence),  .setting  nj+i  co  =  |Vi|  and  Urf+i  ,j,i  = 

"d+i  —  —  \^'d\-  Otherwise,  if  o  €  {0,1}'^  is  not  special  but  nj  „  >  1.  then 

we  recursively  decompose  the  matrix  evenly  (using  the  RF  defined  at  the  start 
of  this  paper).  Let  nd+i.ai  =  L«d.a/2J-  «d+i.ao  =  [”<<,0/21  •  If  d  =  0  then 

■‘^d.a  =  is  the  root  of  the  tree  and  n  is  the  empty  string.  For  0  <  d  <  ^.  each 

matrix  Ad, a  of  depth  d  with  <,  >  I  has  e.xactly  two  children  in  the  tree,  .4^  <,1 
and  Ad  ao  of  depth  d  -I-  1  which  will  be  defined  by  recursion.  In  particular,  for 
d  =  0,  I . —  1  ,  Ad.a  is  an  Hd.a  x  Ud.o  matrix  which  is  partitioned  as 


Ad. a  — 


•■Id+l.oO 
^  d.a 


d.a 

Zd.a 


where 

•  .4d.4" . ou t  ^^<<,0 )  )<<.o»^f<,o  are  matrices, 

•  .4<<+i,a0  =  ^'d  is  of  size  nd+i,„o  x  nj+iao.  (where  if  o  =  1'^  then  nd+i,„o  = 
IV'dl  and  .4d+i,ao  is  a  block  diagonal  matrix  of  size  |kj|  x  |Vd|  with  each 
block  of  size  s(nd)  x  s(Md)), 

®  ^d.Q  ^s  of  size  nd.^i,,jtu  x 

•  Yd,a  is  of  size  rid+i.ai  x  rjd+i.ao  (if -4  is  symmetric,  then  Yd,a  =  (A'd.o)^),, 


®  is  of  size  ^  ^d+i,otii  ^nd 

•  ^d+i,ai  =  Zd.a  -  Vd,a^d+i.oO'^‘<.<»  “  Schur  Complement. 

Note:  The  above  ND  RF  of  A  is  defined  to  be  very  similar  to  the  RF 
.4  =  i4o,v4i,/l2, , . .  In  particular,  Aj  =  Aji*  for  d  =  The  only 

difference  is  that  the  ND  RF  also  recursively  factors  the  matrices 

appearing  in  the  ND  RF  sequence.  This  takes  0{log'n)  time  using  P(s(n)) 
processors.  Since  this  is  an  s(n,f)-block  diagonal  matrix  of  size  \Vd\  x  \Vd\  with 
each  block  of  size  s(n,j)  x  s{nd)),  the  processor  bound  is  0(P(s(  «<<))«/«(  nj))  < 
P(s(n)).  These  are  recursively  factored  evenly  (rather  thjui  use  the  separator 
structure),  using  the  (balanced)  RF  defined  at  the  start  of  this  paper. 

Note  that  the  results  of  Pan  and  Reif  [PR  85,PR  92]  imply  that  the  norm 
and  condition  bounds  of  proposition  5.6  hold  also  for  any  ND  RF. 

We  again  use  the  exact  same  pipelining  technique  used  in  Subsection  4.2  to 
decrease  the  parallel  time  bounds  from  O(log^n)  to  Oflog'n).  We  use  both 
the  analysis  of  (balanced)  RF  defined  at  the  start  of  this  paper,  as  well  as 
the  analysis  of  ND  RF  sequence  defined  in  Pan  and  Reif  [PR  85, PR  92].  In 
particular,  the  proof  of  our  P(s(n))  processor  bounds  follows  e.xactly  from  the 
results  of  Pan  and  Reif  [PR  85. PR  92]  and  from  corollary  3.3  in  this  paper. 
Again  in  this  sparse  s(n)-separable  case  the  work  for  ND  RF  is  dominated  by 
the  computation  of  products  and  inverses  of  a  sequence  of  dense  submatrices  of 
size  s(n<<)  x  s(n<i)  each  requiring  0(log‘ n)  time  and  0(P{nd))  processors  for 

d  =  0 . 1:  =  0(log  n).  However,  due  to  our  use  of  pipelining,  all  these  inverses 

are  computed  simultaneously,  so  the  total  time  is  O(log*  n)  while  the  processor 
bound  remains  5Id=o  0{P{ii{nd)))-  This  sum  is  0(P(s(n)))  if  s(n)  is  of  the  form 
n’’'  for  0  <  7  <  1.  Note  that  if  the  sparsity  graph  of  .4  has  constant  degree  or  is 
planar,  then  the  sparsity  graph  of  ,4^.4  still  has  separator  bound  0(s(n)). 

Theorem  6.1  Let  A  be  an  n  x  n  symmetric  positive  definite  matrix  with  an 
s(n)-separable  sparsity  graph,  where  s(n)  is  of  the  form  n'^  forO  <  7  <  1.  If  A  is 
nonsingular,  then  an  ND  RF  can  be  computed  in  parallel  time  0(log'  n)  time  us¬ 
ing  P(s{n))  processors,  and  we  can  exactly  solve  the  problems  DETERMIN.ANT 
and  LINEAR  SYSTEM  SOLVE  for  A  within  the  same  parallel  complexity. 

7  Banded  Matrices 

Recall  that  A  is  b-banded  if  0,7  =  0  for  2|j  —  i|  +  1  >6.  so  the  non-zero  entries 
occur  only  within  a  band  of  width  6  around  the  diagonal.  Note  that  the  sparsity 
graph  of  A  has  constant  separator  bound  b,  and  also  note  that  the  sparsity  graph 
of  A^ A  also  has  constant  separator  bound  36.  If  s{n)  is  a  constant  6,  then  the 
above  sum  of  proof  of  Theorem  6.1  bounding  the  processors  0(P{s{nd))) 
is  upper  bounded  by  0(P{b)n/b). 
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Corollary  7.1  Let  A  be  an  n  x  n  matrix  which  is  h-banded.  If  A  is  symmet¬ 
ric  positive  definite,  then  an  ND  RF  can  be  exactly  computed  in  parallel  time 
0(iognlog6)  with  P(b)n/b  processors.  We  can  solve  the  problems  DETERMI¬ 
NANT  and  also  LINEAR  SYSTEM  SOLVE  for  nonsingular  A  (not  necessarily 
symmetric  positive  definite)  within  the  same  parallel  complexity,  by  computing 
the  RF  of  AT  A,  using  the  reductions  of  Lemma  2.2  and  2.f. 


8  Further  Work 

In  a  further  paper  [R  93b],  we  show  that  many  structured  linear  systems  can 
be  solved  exactly  and  efficiently  in  parallel,  dropping  processor  bounds  io  lin¬ 
ear,  with  polylog  time  bounds.  We  give  much  improved  parallel  algorithms  for 
the  exact  solution  and  factorization,  determinant,  inverse,  2md  finding  rank  of 
various  structured  matrices;  in  particular  Toeplitz  and  matrices  of  bounded 
displacement  rank,  and  their  generalizations.  These  are  the  first  polylog  time 
bounds  for  these  problems  with  linear  processors.  Our  processor  reduction  is  by 
far  the  major  result  of  that  subsequent  paper;  this  processor  reduction  uses  tech¬ 
niques  specific  to  structured  matrices  (and  not  related  to  the  enclosed  paper) 
However,  using  pipelining  techniques  of  the  enclosed  paper,  we  also  decrease 
our  time  bounds,  for  solving  these  structured  linear  systems,  to  0(log'  n)  using 
n(logn)"  processors.  In  spite  of  this,  we  view  this  speedv*'  due  to  pipelining 
as  less  consequential  compared  to  our  processor  decrease  r  structured  linear 
systems.  We  apply  this  result  to  efficient  parallel  algorithms  for  the  following 
problems  in  the  same  parallel  time  and  processor  bounc  .  Dolynomial  great¬ 
est  common  divisors  (GCD)  and  e.xtended  GCD,  polynon  ,al  resultant.  Fade 
approximants  of  rational  functions  ,  and  shift  register  synthesis  and  BCH  de¬ 
coding  problems.  This  drops  by  a  nearly  linear  factor  the  best  previous  processor 
bounds  for  polylog  time  parallel  algorithms  for  these  problems. 
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